Method and apparatus for spam message detection

ABSTRACT

A method, apparatus and computer program product for spam message detection. The method includes collecting time domain transmission characteristic of a message source; computing frequency domain transmission characteristic of the message source with the time domain transmission characteristic of the message source; and identifying the message source to be a spammer in response to the frequency domain transmission characteristic of the message source satisfying predefined criteria; wherein the steps of the method are carried out using a computer device. An apparatus and computer program product for carrying out the above method is also provided.

CROSS REFERENCE TO RELATED APPLICATION

This application claims priority under 35 U.S.C. §119 from ChinesePatent Application No. 200910139811.9 filed Jun. 30, 2009, the entirecontents of which are incorporated herein by reference.

This application is a Continuation application of allowed co-pendingU.S. patent application Ser. No. 12/821,230 filed on Jun. 23, 2010,incorporated herein by reference in its entirety.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to the field of information processing,and more particularly, to a method and apparatus for spam messagedetection.

2. Description of the Related Art

Spam messages and spam mail affect user experience and systemperformance. There exist a variety of approaches for detecting spammessages. One such approach is a user feedback based approach, where auser identifies and reports a spammer. According to another approach,such as a social network based approach, a social network archive isestablished for each user and a message transmitted by the user to otherusers outside of the social network is determined to be a spam message.A relatively large data record system is required to store the reportedspammer or the social network archive, and such a data record systemneeds to be shared among various service operators, which complicatesthe feasibility of using these systems across various service operators.

According to a message content based approach, a message will bedetermined to be a spam message if it contains a preset keyword. In thisapproach, an excessively small set of keywords will cause a high falsenegative rate, while an excessively large set of keywords will affect adetection speed. This may lead to privacy concerns since the approachchecks message content. In addition, the spammer can escape detection ina simple, flexible manner such as inserting a space within a keyword.

can be determined to be a spammer if it transmits bulk messages orrepeated messages in a short span of time. The spammer can reduce thenumber of messages transmitted by each message source within the shortspan of time by making multiple message sources transmit messages inturns, while a normal user may transmit bulk messages in a short span oftime under some circumstances.

SUMMARY OF THE INVENTION

Embodiments of the invention propose a method for detecting spam messagesuch that a spammer cannot escape detection through the above-mentionedsimple means.

According to an embodiment of the invention, a method for spam messagedetection is presented. The method includes:

collecting a time domain transmission characteristic of a messagesource;

computing a frequency of the domain transmission which is characteristicof the message source using the time domain transmission characteristicof the message source; and

identifying the message source as a spammer in response to the frequencydomain transmission characteristic of the message source meeting apredefined condition;

wherein the steps of the method are carried out using a computer device.

According to another embodiment of the invention, an apparatus fordetecting a spam message is presented. The apparatus includes:

a collection means configured to collect time domain transmissioncharacteristic of a message source;

a computation means configured to compute frequency domain transmissioncharacteristic of the message source using the time domain transmissioncharacteristic of the message source; and

an identification means configured to identify the message source as aspammer in response to the frequency domain transmission characteristicof the message source meeting a predefined condition.

According to yet another embodiment of the invention, a computer programproduct for detecting a spam message is presented. The computer programproduct includes:

a computer readable storage medium having computer readable programcode. The computer readable program code includes computer readableprogram code configured to execute the above method. The methodincludes:

collecting a time domain transmission characteristic of a messagesource;

computing a frequency domain transmission characteristic of the messagesource using the time domain transmission characteristic of the messagesource; and

identifying the message source as a spammer in response to the frequencydomain transmission characteristic of the message source meeting apredefined condition;

wherein the steps of the method are carried out using a computer device.

Therefore, in accordance with the embodiment of the invention, a spammerthat makes multiple message sources transmit messages in turns can bedetected through the frequency domain transmission characteristic,thereby compensating or ameliorating any defects in the previouslymentioned approaches.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1(A) to 1(C) are diagrams illustrating the frequency domaintransmission characteristics for different types of message sources.

FIG. 2 is a block diagram of a method for spam message detectionaccording to an embodiment of the invention.

FIGS. 3(A) to 3(D) show a model parameter distribution of a knownspammer under four different time domain sample intervals.

FIG. 4 is result of spam message detection according to an embodiment ofthe invention.

FIG. 5 is a flowchart of session detection according to an embodiment ofthe invention.

FIG. 6 is a block diagram of an apparatus for spam message detectionaccording to an embodiment of the invention.

FIG. 7 is a detailed schematic of a data processing system used forimplementing the exemplary embodiments of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Embodiments of a method and apparatus for spam message detectionprovided by the invention will be described in detail below and shouldbe read in conjunction with the accompanying drawings. When a firstelement is depicted to be connected to a second element, the firstelement not only can be directly connected to the second element, butcan also be indirectly connected to the second element through a thirdelement. Further, for the sake of clarity, some elements that areunnecessary to fully understand embodiments of the present invention areomitted.

FIGS. 1(A) to 1(C) are a diagram of frequency domain transmissioncharacteristic for different types of message sources. FIG. 1(A) is adiagram of frequency domain transmission characteristic of a messagesource that transmits messages in bursts. FIG. 1(B) is a diagram offrequency domain transmission characteristic of a message source thattransmits messages in periodical intervals. FIG. 1(C) is a diagram offrequency domain transmission characteristic of a message source thattransmits messages in a random manner.

The message source in FIG. 1(A) transmits bulk spam messages in a shorttime interval, thereby transmitting numerous spam messages before such abehavior can be detected by the system. Such a spammer can be detectedas disclosed previously by the approach based on message transmissionspeed.

The message source in FIG. 1(B) is one message source within a group ofmessage sources. This group of message sources takes turns to transmitmessages. For each message source within the group, the transmissionspeed does not reach a threshold, and therefore will not be identifiedas a spammer by the approach based on message transmission speed. Themessage source in FIG. 1(C) corresponds to that of a normal user, whichtransmits messages in a random manner, thus its frequency domaintransmission characteristic does not show any regularity.

As shown in FIGS. 1(A) to 1(C), the frequency domain transmissioncharacteristics of different types of message sources have significantdifferences. Thus, the type to which each message source belongs can bedetermined from the corresponding frequency domain transmissioncharacteristic, thereby determining whether the message source is aspammer.

FIG. 2 is a block diagram of a method for spam message detectionaccording to an embodiment of the invention. As described above, thetypes of each message source can be determined based on its frequencydomain transmission characteristic. Therefore, the method for spammessage detection according to an embodiment of the invention includes:

At step 201, time domain transmission characteristic of a message sourceis collected.

The time domain transmission characteristic of a message source can beobtained through a variety of channels. For example, the so called timedomain transmission characteristic can actually be determined by thearrival time of the message. In other words, the network can identifyonly the time when a message arrives on the network side, and generallycannot identify the time when the message was transmitted from a messagesource. The arrival time of the message can be determined for example,from Call Detail Record (CDR) or various databases. It should be obviousto one skilled in the art that various other techniques may be used todetermine the arrival time of the message, and all such methods andtechniques fall within the scope of the embodiments of the invention.

At step 202, frequency domain transmission characteristic of the messagesource is computed with the time domain transmission characteristic ofthe message source.

At step 203, it is determined whether the frequency domain transmissioncharacteristic of the message source meets a predefined condition, andif so, the message source is identified as a spammer.

The predefined condition can include a variety of forms which match thefrequency domain transmission characteristic of a predefined spammertemplate, or do not match the frequency domain transmissioncharacteristic of a predefined non-spammer template. In particular, thefrequency domain transmission characteristic of a message source isgenerally represented by a set of parameters. Variance of this set ofparameters can be computed. If the variance of these parameters isgreater than a variance threshold, the message source may be consideredto be a spammer. This is typically because a spammer cannot transmitmessages in an identical random manner as a normal user does, and thusits frequency domain transmission characteristic always have relativelyobvious peaks and valleys, which corresponds to relative large varianceof the parameters of frequency domain transmission characteristic.

While frequency domain transmission characteristic of an ordinarymessage source that transmits messages randomly is similar to whitenoise, its spectral distribution is relatively smooth, which correspondsto relative small variance of the parameters of frequency domaintransmission characteristic. Thus, the number of message sources forwhich parameter comparison is required is reduced, thereby increasingthe processing speed.

Next, the implementation of steps 202 and 203 will be described indetail, especially in a case where the predefined condition is thatfrequency domain transmission characteristic of a predefined spammertemplate are matched. According to one embodiment, when a frequencydomain transmission characteristic of a message source is computed withtime domain transmission characteristic of the message source, the timedomain transmission characteristic is transformed into a frequencydomain transmission characteristic by using Fourier transformation.

Fourier transformation is a known technique that is well know to aperson skilled in the art and for the sake of brevity, is not describedin this document. After obtaining frequency domain transmissioncharacteristic by application of Fourier transformation, it can bedetermined whether the frequency domain transmission characteristicmatches the frequency domain transmission characteristic of thepredefined spammer template by comparison of the frequency domaintransmission characteristic of the message source and the predefinedspammer template.

However, a drawback that exists in obtaining frequency domaintransmission characteristic through Fourier transformation is thatFourier transformation is dependent on discrete sampling. Discretesampling causes spectrum extension and spectrum aliasing, therebyintroducing noise into the spectrum. Noise often overwhelms the desiredfrequency domain transmission characteristic, leading to accuracydegradation. However, overcoming the influence of noise to achieve therequired accuracy requires increasing the order of the Fouriertransformation, leading to a corresponding increase in memory overheadfor caching data for the time domain transmission characteristic andcomputation overhead for performing the Fourier transformation andparameter comparison.

Therefore, in an embodiment of the invention, frequency domaintransmission characteristic of a message source are estimated by using amodel. It is then determined whether the frequency domain transmissioncharacteristic matches frequency domain transmission characteristic of apredefined spammer template. If there is a match the message source willbe a considered a spammer. According to a present embodiment of theinvention, a model such as an Autoregressive (AR) model, anAutoregressive Moving Average (ARMA) model or a Moving Average (MA)model is established for the message source. Since there is no feedbackof output to input in the system acting as a message source, preferably,the message source is modeled as an Autoregressive (AR) model. Thedefinition of a M-order Autoregressive model is:

$\begin{matrix}{{x(t)} = {{\sum\limits_{m = 1}^{M}{a_{m}{x\left( {t - m} \right)}}} + {ɛ(t)}}} & (1)\end{matrix}$

According to the model, value of x at a current time point is a linearcombination of the values of x at past M time points plus white noiseε(t), whose average value is zero and variance is computed as σ².

a₁ . . . a_(M) are M model parameters which constitute the model'sparameter set. σ² is model gain. Thus, estimating frequency domaintransmission characteristic of the message source estimates these modelparameters and model gain in an AR model for the message source.However, comparing a frequency domain transmission characteristic withpredefined spammer template implies comparing corresponding modelparameters, and the model gain σ² will not be compared. The followingwill explain why the model gain σ² is not compared. With this kind ofmethod, the number of parameters to be compared can be set flexibly.

Next, estimating the model parameters in an AR model by using timedomain transmission characteristic of a message source will bedescribed. For a message source, the number of messages transmittedwithin a time period is typically detected by using a sliding window.For an M-order AR model, the sliding window has M+1 panes eachcorresponding to a time domain sample interval, and assumes that thelength of the time domain sample interval is P. The number of messagestransmitted by the message source in any one of the time domain sampleintervals can be easily determined. At time point 0, value of each paneis zero, and at time point P, the number of messages transmitted by themessage source in the time period from time point 0 to time point P iscomputed as a value of a first pane. At time point 2P, the number ofmessages transmitted by the message source in the time period from timepoint P to time point 2P is computed as the value of a second pane.

This process is continued until time point (M+1)P, where the number ofmessages transmitted by the message source in the time period from timepoint MP to time point (M+1)P is computed as value of a M+1^(th) pane.Thereafter, at time point (M+2)P, let the value of the first pane beequal to that of the second pane, the value of the second pane equal tothat of the third pane, and so on until the value of the M+1^(th) paneis equal to the number of messages transmitted by the message source inthe time period from time point (M+1) to time point (M+2)P.

Thus sliding windows are formed. Compared to the embodiment thatutilizes Fourier transformation, the setting of length of time domainsample interval is more flexible. This is because the embodiment thatutilizes Fourier transformation needs to collect each piece of message,while the present embodiment needs to collect statistics on the totalnumber of messages in a certain interval.

At time point (M+1)P, autocorrelation of each value in M+1 panes of thesliding window is computed, using:

$\begin{matrix}{{R(m)} = \frac{\sum\limits_{t = 1}^{M + 1 - m}{{x(t)}{x\left( {t + m} \right)}}}{M + 1 - m}} & (2)\end{matrix}$

Where R(m) indicates the autocorrelation factor. The value of M+1autocorrelations can be computed according to each value in the panes ofcurrent sliding window. When the sliding window slides, the value of thefirst pane of the sliding window is discarded. The advantage is thatcomputations are performed according to each value in the panes of thecurrent sliding window only at a time point that computation ofautocorrelation is required, so that at another time point, the value ofeach of the panes in the sliding window only needs to be updated.

Autocorrelation can be quasi-autocorrelation, which can be computedaccording to the following formulas first at every P time point startingfrom time point (M+1)P:

${R^{\prime}(0)} = {\sum\limits_{t = 1}^{T}{x^{2}(t)}}$${R^{\prime}(1)} = {\sum\limits_{t = 1}^{T - 1}{{x(t)}{x\left( {t + 1} \right)}}}$…${R^{\prime}(M)} = {\sum\limits_{t = 1}^{t - M}{{x(1)}{x\left( {t + M} \right)}}}$

where T is a natural number that is not less than M+1. The value ofcorresponding autocorrelation is computed according to the followingformula:

$\begin{matrix}{{R(m)} = \frac{R^{\prime}(m)}{T - m}} & (3)\end{matrix}$

where, m is an integer that is not less than 0 but not great than M. Theadvantage here is that the number of messages transmitted at all timedomain sample intervals starting from time point 0 can be considered forcomputation.

Thus a total of M+1 values ranging from R(0) to R(M) can be computed.With this M+1 values and in conjunction with Formula (I) given above,M+1 equations as shown below can be formed, so that M model parametersa₁ to a_(M) and model gain σ² can be resolved.

$\begin{matrix}{{\begin{pmatrix}{R(0)} & {R(1)} & {R(2)} & \ldots & {R(M)} \\{R(1)} & {R(0)} & {R(1)} & \ldots & {R\left( {M - 1} \right)} \\{R(2)} & {R(1)} & {R(0)} & \ldots & {R\left( {M - 2} \right)} \\\vdots & \vdots & \vdots & \vdots & \vdots \\\vdots & \vdots & \vdots & \vdots & \vdots \\\vdots & \vdots & \vdots & \vdots & \vdots \\{R(M)} & {R\left( {M - 1} \right)} & {R\left( {M - 2} \right)} & \ldots & {R(0)}\end{pmatrix}\begin{pmatrix}1 \\a_{1} \\a_{2} \\\vdots \\\vdots \\\vdots \\a_{M}\end{pmatrix}} = \begin{pmatrix}\sigma^{2} \\0 \\0 \\0 \\0 \\0 \\0\end{pmatrix}} & (4)\end{matrix}$

It can be illustrated by a person skilled in the art that when a₁ toa_(M) are all real numbers, the transfer function of the system can berepresented as:

$\begin{matrix}{{H(z)} = \frac{\sigma}{\sum\limits_{i = 1}^{M}{a_{i}z^{i}}}} & (5)\end{matrix}$

Using the method disclosed above, the frequency domain transmissioncharacteristic of the message source can be determined once a₁ to a_(M)and σ² are estimated. For example, by comparing computed a₁ to a_(M) ofthe message source with that of the spammer template, it can bedetermined whether frequency domain transmission characteristic of themessage source matches the predefined spammer template, therebydetermining whether the message source is a spammer. Again, for example,experimentally it has been found that for two message sources havingdifferent time periods, if a first message source transmits moremessages than a second message source at the arrival of each timeperiod, then σ² of the first message source is larger than that of thesecond message source.

Since σ² can be estimated through R(0), which means that R(0) can betaken as a standard for preliminary filtering. If R(0) of a messagesource is larger than an average power threshold, then the messagesource is considered to be a spam message source. Only R(0) needs to becomputed at preliminary filtering, so that the need to compute R(1)-R(M)does not arise, and hence there is no need to resolve a₁ to a_(M) and tocompare a₁ to a_(M). R(0) is often referred to as the signal's averagepower. While computing R(0), either only value of each pane in currentsliding window is to be considered or it can be obtained by computingthe quasi-autocorrelation first and then derived using Formula (3).

With the method according to embodiments of the present invention,length of the time domain sample interval can be set flexibly. Thenumber of parameters to be compared, i.e. M, can also be set flexibly.However, frequency domain transmission characteristic of a spammer, suchas the period used, may not be the same. If the length of time domainsample interval is relatively long, then a spammer that uses shortertime period cannot be captured, and if the length of time domain sampleinterval is relatively short, capturing a spammer that uses longer timeperiod, leads to overheads as too many sample points are required.

Yet a further embodiment of the invention will be described below.According to the Time Domain Sampling Theory, for a spectrum limitedsignal f(t), if the frequency is between 0 to f_(m), then the signalf(t) can be uniquely represented without distortion by a series of timedomain sampling values having equal interval only if the interval oftime domain sampling is not greater than 1/(2f_(m)), or the samplingfrequency is not lower than 2f_(m). Thus, if P is the length of timedomain sample interval to perform time domain sampling, then that timedomain sampling can, without distortion, represent a signal whosefrequency is lower than ½P. If the number of samples is N, then thesampled sampling value only exists between a time range of 0 to NP.Therefore, signals whose time period is longer than NP cannot berepresented by the time domain sampling. Thus, the range between 1/NP to½P is the effective discrimination interval of the time domain samplingwhose length of time domain sample interval is P and the number ofsamples is N versus frequency domain characteristic.

For example, if P takes the following values P1=4, P2=16, P3=128 andP4=1024 (where the unit of measurement can be “second” or any suitabletime unit), then the corresponding effective discrimination intervalsare:

Interval1: ¼N to ⅛,

Interval2: 1/16N to 1/32,

Interval3: 1/128N to 1/256,

Interval4: 1/1024N to 1/2048.

Where the number of samples N>2, the length of interval1 to interval4decreases in turn. Using higher frequency domain discrimination in ashorter effective discrimination interval, and using lower frequencydomain discrimination in a longer or higher effective discriminationinterval may be used. In other words, use lower frequency domaindiscrimination in an effective discrimination interval of high frequencyend, and use higher frequency domain discrimination in effectivediscrimination interval of low frequency end. According to frequencydomain sampling theory, for a time limited signal f(t), existing in timerange from 0 to t_(m), the condition that the signal f(t) can beuniquely and distortionlessly represented by value of frequency domainsampling performed on that signal's spectrum F(f) with equal interval isconsidered to be the interval of frequency domain sampling that is notgreater than 1/t_(m). Thus, discrimination of frequency domaincharacteristic obtained by time domain sampling whose length of timedomain sample interval is P and the number of samples is N must be 1/NPat the minima. 1/NP can be used as the discrimination in variouseffective discrimination intervals.

The analysis described above can be applied in the model estimationmethod presented in one embodiment, for example multiple frequencydomain transmission characteristic of a message source under multiplelength of time domain sample interval are estimated with a multitiermodel, then it is required to determine whether the multiple frequencydomain transmission characteristic match multiple predefined spammertemplates respectively. If any one of the spammer templates is matched,then the message source is determined as a spammer. In particular, aspammer that uses short time periods can be captured by using shorterlength of time domain sample interval with lower frequency domaindiscrimination, and a spammer that uses long time period can be capturedby using longer length of time domain sample interval with higherfrequency domain discrimination.

Generally, after a match occurs, sampling on the message will bestopped. This ensures that spammers that use short period are captured,and excessive overhead caused by too many samples is avoided.Preferably, the length of longer time domain sample interval is aninteger multiple of that of a shorter time domain sample interval.Advantageously, the number of messages transmitted by the message sourcein a longer time domain sample interval can be obtained by summing upthe number of messages transmitted by the message source in severalshorter time domain sample intervals.

In particular, if four spammer templates need to be established, a totalof four types of time domain sample intervals P1 to P4 will be used. P1to P4 are used to perform sampling on a same message sourcerespectively, that is, at every P1, the number of messages arrivedcorresponding to that message source in that P1 interval is computed.Similarly, the same is repeated for the other intervals P2, P3 and P4.For each type of time domain sample interval, the method includesestimating the system transfer function, i.e., the respective modelparameters are estimated. The estimated model parameters are comparedwith that of the spammer template to determine whether a match exists;and if for any of the spammer templates a match is found, then themessage source is determined as a spammer.

Next, the parameter comparison process according to yet a furtherembodiment of the invention will be described in detail in conjunctionwith experimental result. It will be readily recognized by a personskilled in the art that the following method is also equally applicableto the other embodiment disclosed herein. A model having a same form asthat of a message source is established for a spammer. The modelparameter set of a spammer template can be set manually, or it can beobtained by collecting statistics on frequency domain transmissioncharacteristic of a great number of known spammers.

FIG. 3(A) to 3(D) shows distribution of model parameters computed forfour different types of time domain sample interval and for a relativelylarge number of known spammers, where P1=4, P2=16, P3=128 and P4=1024are the length of time domain sample interval, N is fixed to 11, thisimplies M is 10, and four sets of a₁ to a_(M) can be obtained. Asdisclosed above, although σ² can also be obtained when a₁ to a_(M) isobtained, σ² is primarily used for preliminary filtering, and will nolonger be considered when comparing with spammer template. Each of themodel parameters is located in interval of [−1, 1], thus dividing theinterval into 32 sub-intervals, what is shown by FIG. 3(A) to 3(D) isnumber of message sources whose model parameters fall into eachsub-intervals.

Reference is now made to FIG. 3(A), which represents a distribution ofmodel parameters in cases where the length of the time domain sampleinterval is 4. Each column represents one of a₁ to a_(M). The firstcolumn in the left represents distribution of a₁. The first row of thefirst column in the left is 0, indicating that none of the known spammerhas model parameters within interval [ 15/16, 1]. The 16^(th) row of thefirst column in the left is 31, indicating that 31 of the known spammershave model parameters are within interval [0, 1/16]. FIGS. 3(A) to 3(D)illustrate various methods for collecting statistics such as weightedaverage, which can then be utilized to compute respective modelparameters a₁ to a_(M) of the four spammer templates corresponding to P1to P4.

It should be noted that the number of sub-intervals, the order of themodel and the number of spammer templates are all illustrative, andthose skilled in the art will easily realize that various other suitablesettings can be used and these fall within the scope of the embodimentsof the present invention.

After obtaining the model parameters for the spammer template, the modelparameters for the message source can be compared with that of thespammer template to determine whether there is a match. According to amatch determination method based on distance, the model parameters areconsidered as a M-dimension space, each set of model parameters a₁ toa_(M) being regarded as a point in space, distance such as Euclideandistance between model parameters for the message source and that of thespammer template is computed to determine if the distance meets apredefined condition. The predefined condition for example can be setsuch that the Euclidean distance computed does not exceed a distancethreshold then a match is considered to be successful.

Alternatively, a template of other types of message source besides aspammer can be introduced. The Euclidean distance between modelparameters of the message source and that of the spammer template, andEuclidean distance between model parameters of the message source andthat of a non-spammer template can be computed. The predefined conditionis that if the former is smaller, then typically the message sourcematches the spammer template. It should be obvious to a person skilledin the art that there are many other methods on comparing sets ofparameters to determine a match, and such method fall within the scopeof the embodiment of the present invention.

FIG. 4 illustrates detection of the method for spam message detectionaccording to an embodiment of the invention. Each point in the figurerepresents a message source, having approximately 2,000,000 messagesources, and the number of messages detected being approximately40,000,000. P1=4, P2=16, P3=128 and P4=1024 are length of time domainsample interval, M is fixed to 10. The horizontal axis exemplarilyrepresents R(0), and the vertical axis exemplarily represents varianceof model parameters of message sources. If R(0) of the message source isgreater than an average power threshold, then the source is consideredto be a spammer, and if the variance of the model parameters of themessage source is lower than a variance threshold, then the messagesource is not a spammer. For a message source whose R(0) is less than anaverage power threshold but whose variance of model parameters is higherthan a variance threshold, to be a spammer, its model parameters shouldmatch that of a spammer template.

To verify the correctness of the models, it is manually determinedwhether an individual message source whose message transmission speed islarger than a transmission speed threshold is indeed a spammer. It canbe seen from FIG. 4 that, there is no false negative in the messagesources that have been determined manually, but there exists a few falsepositives. However, since the number of messages detected isapproximately 2,000,000, the false positive rate is appreciably low.

Embodiments of the present invention may include several additionalsteps to improve correctness and speed of determination. As mentionedabove, average power threshold and variance threshold can be used forpreliminary filtering. Again, for example, detection based on messagetransmission speed can be further include between step 201 and step 202,the message source whose transmission speed is larger than atransmission speed threshold is regarded as a spammer. Further, a LeakyBucket mechanism can be employed between step 201 and step 202, so thatthe message source whose transmission speed is larger than atransmission threshold and time length reaches a time length thresholdis regarded as a spammer. Transmission speed threshold and time lengththreshold can also be used as a criterion for triggering step 202.

Again, for example, in step 201, preliminary filtering can be performedbased on whether a new session is established. The process enters intostep 202 only when it is determined that the arrived message hasestablished a new session. In this way, two parties that transmitmessages with high speed can be excluded from being a spammer.

The specific method is shown in FIG. 5. At step 501, arrival of amessage is detected. At step 502, based on the message's sender andreceiver, it is determined whether the sender and receiver is a newsender-receiver pair, and if so, the process proceeds to step 505,otherwise, to step 503.

At step 503, in response to negative determination on thesender-receiver pair, it is determined whether the interval between thearrival time of the message and a previous message corresponding to thesame sender-receiver pair exceeds an interval threshold, if so, theprocess proceeds to step 505, else, to step 504.

At step 504, considering that a new session is not established, theprocess does not enter into step 202.

At step 505, considering that a new session is established the processwill enter into step 202.

Step 503 is optional. It can be considered that a new session is notestablished as long as it is determined that message's sender-receiverpair already exists.

In the above description, it should be noted that the spammer isdescribed as an entity that transmits messages periodically. However,embodiments of the invention are not limited in a sense to be used todetect a spammer that transmits messages periodically. Even if thespammer's time domain transmission characteristic appear as randomtransmission through certain means, its frequency domain transmissioncharacteristic still present a feature that is different from thefrequency domain transmission characteristic of an ordinary messagesource, such that it can be detected by the method as disclosed in theembodiments of the present invention.

FIG. 6 is a block diagram of an apparatus for spam message detectionaccording to an embodiment of the invention. The apparatus includes: acollection means configured to collect time domain transmissioncharacteristic of a message source; a computation means configured tocompute frequency domain transmission characteristic of the messagesource with the time domain transmission characteristic of the messagesource; and an identification means configured to determine the messagesource as a spammer in response to the frequency domain transmissioncharacteristic of the message source meets a predefined condition.

FIG. 7 shows a detailed schematic of a data processing system,hereinafter referred to as a computer system, used to implement theexemplary data flow embodiments as illustrated in previous figures. Thecomputer system 700 includes at least a processor 704. It should beunderstood although FIG. 7 illustrates a single processor, one skilledin the art would appreciate that more than one processor can be includedas needed. The processor 704 is connected to a communicationinfrastructure 702 (for example, a communications bus, cross-over bar,or network) where the communication infrastructure 704 is configured tofacilitate communication between various elements of the exemplarycomputer system 700. Various software embodiments are described in termsof this exemplary computer system. After reading this description, itwill become apparent to a person of ordinary skill in the relevantart(s) how to implement the invention using other computer systemsand/or computer architectures.

Exemplary computer system 700 can include a display interface 708configured to forward graphics, text, and other data from thecommunication infrastructure 702 (or from a frame buffer not shown) fordisplay on a display unit 710. The computer system 700 also includes amain memory 706, which can be random access memory (RAM), and may alsoinclude a secondary memory 712. The secondary memory 712 may include,for example, a hard disk drive 714 and/or a removable storage drive 716,representing a floppy disk drive, a magnetic tape drive, an optical diskdrive, etc. The removable storage drive 716 reads from and/or writes toa removable storage unit 718 in a manner well known to those havingordinary skill in the art. The removable storage unit 718, represents,for example, a floppy disk, magnetic tape, optical disk, etc. which isread by and written to by the removable storage drive 716. As will beappreciated, the removable storage unit 718 includes a computer usablestorage medium having stored therein computer software and/or data.

In exemplary embodiments, the secondary memory 712 may include othersimilar means for allowing computer programs or other instructions to beloaded into the computer system. Such means may include, for example, aremovable storage unit 722 and an interface 720. Examples of such mayinclude a program cartridge and cartridge interface (such as that foundin video game devices), a removable memory chip (such as an EPROM, orPROM) and associated socket, and other removable storage units 722 andinterfaces 720 which allow software and data to be transferred from theremovable storage unit 722 to the computer system 700.

The computer system 700 may also include a communications interface 724.The communications interface 724 allows software and data to betransferred between the computer system and external devices. Examplesof the communications interface 724 may include a modem, a networkinterface (such as an Ethernet card), a communications port, a PCMCIAslot and card, etc. Software and data transferred via the communicationsinterface 724 are in the form of signals which may be, for example,electronic, electromagnetic, optical, or other signals capable of beingreceived by communications interface 724. These signals are provided tothe communications interface 724 via a communications path (that is,channel) 726. The channel 726 carries signals and may be implementedusing wire or cable, fiber optics, a phone line, a cellular phone link,an RF link, and/or other communications channels.

With reference to the embodiments disclosed, the terms “computer programmedium,” “computer usable medium,” and “computer readable medium” areused to generally refer to media such as the main memory 706 and thesecondary memory 712, the removable storage drive 716, a hard diskinstalled in the hard disk drive 714, and signals. These computerprogram products are means for providing software to the computersystem. The computer readable medium allows the computer system to readdata, instructions, messages or message packets, and other computerreadable information from the computer readable medium. The computerreadable medium, for example, may include non-volatile memory, such asFloppy, ROM, Flash memory, Disk drive memory, CD-ROM, and otherpermanent storage.

It can be used, for example, to transport information, such as data andcomputer instructions, between computer systems. Furthermore, thecomputer readable medium may include computer readable information in atransitory state medium such as a network link and/or a networkinterface, including a wired network or a wireless network that allows acomputer to read such computer readable information.

Computer programs (also referred to herein as computer control logic)are stored in the main memory 706 and/or the secondary memory 712.Computer programs may also be received via the communications interface724. Such computer programs, when executed, can enable the computersystem to perform the features of exemplary embodiments of the presentinvention as discussed herein. In particular, the computer programs,when executed, enable the processor 704 to perform the features of thecomputer system 700. Accordingly, such computer programs representcontrollers of the computer system.

It may be appreciated by a person skilled in the art that, the abovemethod and system can be implemented by using computer executableinstructions and/or included in processor control codes, which areprovided on carrier medium such as disk, CD or DVD-ROM, programmablememory such as read-only memory or data carrier such as optical orelectrical signal carrier. The apparatus/system for spam messagedetection and its components can be implemented by hardware circuit suchas large scale integrated circuit or gate arrays, semiconductors such aspoint logic chip or transistors, or programmable hardware devices suchas field programmable gate array, programmable logic device, or can beimplemented by software executed by various types of processors, or canbe implemented by a combination of the above hardware circuit andsoftware, such as firmware.

Further, although process steps, method steps or the like may bedescribed in a sequential order, such processes, methods and algorithmsmay be configured to work in alternate orders. In other words, anysequence or order of steps that may be described does not necessarilyindicate a requirement that the steps be performed in that order. Thesteps of processes described herein may be performed in any orderpractical. Further, some steps may be performed simultaneously, inparallel, or concurrently. Further, some or all steps may be performedin run-time mode.

Computer program means or computer program in the present context meanany expression, in any language, code or notation, of a set ofinstructions intended to cause a system having an information processingcapability to perform a particular function either directly or aftereither or both of the following a) conversion to another language, codeor notation; b) reproduction in a different material form.

The terms “certain embodiments”, “an embodiment”, “embodiment”,“embodiments”, “the embodiment”, “the embodiments”, “one or moreembodiments”, “some embodiments”, and “one embodiment” mean one or more(but not all) embodiments unless expressly specified otherwise. Theterms “including”, “comprising”, “having” and variations thereof mean“including but not limited to”, unless expressly specified otherwise.The enumerated listing of items does not imply that any or all of theitems are mutually exclusive, unless expressly specified otherwise. Theterms “a”, “an” and “the” mean “one or more”, unless expressly specifiedotherwise.

Although some exemplary embodiments of the present invention have beenillustrated and described, those skilled in the art will appreciatethat, changes to these embodiments can be made without departing fromthe principle and spirit of the invention, the scope of the invention isdefined by claims and their equivalent transformations.

We claim:
 1. A method for detecting a spam message, the methodcomprising: computing a frequency domain transmission characteristic ofa message source using a time domain transmission characteristic of themessage source; and identifying the message source as being a spammerbased on the frequency domain transmission characteristic; wherein thesteps of the method are carried out using a computer device.
 2. Themethod of claim 1, further comprising: identifying the message source asa spammer in response to the frequency domain transmissioncharacteristic of the message source matching a frequency domaintransmission characteristic of a spammer template.
 3. The method ofclaim 1, further comprising: identifying the message source as a spammerin response to the frequency domain transmission characteristic of themessage source not matching a frequency domain transmissioncharacteristic of a non-spammer template.
 4. The method of claim 1,further comprising: identifying the message source as a spammer inresponse to the variance of a parameter of the frequency domaintransmission characteristic of the message source being greater than avariance threshold.
 5. The method of claim 2, further comprising:estimating the parameter set of the model corresponding to the messagesource with the time domain transmission characteristic of the messagesource; and identifying the message source as a spammer in response tothe parameter set of the model corresponding to the message sourcematching the parameter set of the model corresponding to the spammertemplate; wherein models having the same form are built for the messagesource and the spammer template, and the frequency domain transmissioncharacteristic is represented by a parameter set of the model.
 6. Themethod of claim 2, further comprising: estimating at least two parametersets of the model corresponding to the message source with the timedomain transmission characteristic of the message source by taking atleast two different values as length of time domain sample intervals,respectively; and determining that the message source is a spammer inresponse to any one of the at least two parameter sets of the modelcorresponding to the message source matching any one of the parameterset of the model corresponding to the first spammer template and theparameter set of the model corresponding to the second spammer template;wherein the spammer template comprises at least a first spammer templateand a second spammer template, and models having the same form are builtfor the message source, the first spammer template and the secondspammer template.
 7. The method of claim 6, wherein one of the at leasttwo different values is a positive integer multiple of the other.
 8. Themethod of claim 1, further comprising: computing an average power of themessage source with the time domain transmission characteristic of themessage source: identifying the message source as a spammer in responseto the average power being greater than an average power threshold; andexiting the process.
 9. The method of claim 1, further comprising:determining that a received message has established a new session inaccordance with the time domain transmission characteristic of themessage source; and computing the frequency domain transmissioncharacteristic of the message source with the time domain transmissioncharacteristic of the message source in response to the received messagehaving established a new session.
 10. An apparatus for detecting a spammessage, the apparatus comprising: computation means configured tocompute a frequency domain transmission characteristic of a messagesource using a time domain transmission characteristic of the messagesource; and identification means configured to identify said messagesource as being a spammer based on the frequency domain transmissioncharacteristic.
 11. The apparatus of claim 10, further comprising: meansconfigured to identify the message source as a spammer in response tothe frequency domain transmission characteristic of the message sourcematching a frequency domain transmission characteristic of a spammertemplate.
 12. The apparatus of claim 10, further comprising: meansconfigured to identify the message source as a spammer in response tothe frequency domain transmission characteristic of the message sourcenot matching the frequency domain transmission characteristic of anon-spammer template.
 13. The apparatus of claim 10, further comprising:means configured to identify the message source as a spammer in responseto a variance of parameter of the frequency domain transmissioncharacteristic of the message source being greater than a variancethreshold.
 14. The apparatus of claim 11, further comprising: meansconfigured to estimate the parameter set of the model corresponding tothe message source with the time domain transmission characteristic ofthe message source; and means configured to identifying the messagesource as a spammer in response to the parameter set of the modelcorresponding to the message source matching the parameter set of themodel corresponding to the spammer template; wherein models having thesame form are built for the message source and the spammer template, andthe frequency domain transmission characteristic is represented by aparameter set of the model.
 15. The apparatus of claim 11, furthercomprising: means configured to estimate at least two parameter sets ofthe model corresponding to the message source with the time domaintransmission characteristic of the message source, taking at least twodifferent values as length of time domain sample intervals,respectively; and means configured to identify the message source as aspammer in response to any one of the at least two parameter sets of themodel corresponding to the message source matching any one of theparameter set of the model corresponding to the first spammer templateand the parameter set of the model corresponding to the second spammertemplate; wherein the spammer template comprises at least a firstspammer template and a second spammer template, and models having thesame form are built for the message source, the first spammer templateand the second spammer template.
 16. The apparatus of claim 15, whereinone of the at least two different values is a positive integer multipleof the other.
 17. The apparatus of claim 10, further comprising: meansconfigured to compute average power of the message source with the timedomain transmission characteristic of the message source; and meansconfigured to: identify the message source as a spammer in response tothe average power being greater than an average power threshold; andexit the flow.
 18. The apparatus of claim 10, further comprising: meansconfigured to identify whether arrived message has established a newsession according to the time domain transmission characteristic of themessage source; and means configured to initiate the computation meansin response to the arrived message having established a new session. 19.A computer program product for detecting a spam message, the computerprogram product comprising: a computer readable storage medium havingcomputer readable program code embodied therewith, the computer readableprogram code comprising computer readable program code configured toexecute a method, the method comprising: computing a frequency domaintransmission characteristic of a message source using a time domaintransmission characteristic of the message source; and identifying saidmessage source as being a spammer based on the frequency domaintransmission characteristic.